-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add collapseHz()
#307
Add collapseHz()
#307
Conversation
This is a great idea, and perfect extension to the GHL foundation of ideas and methods. I'm surprised that we never included it before. From the texture class example above, it isn't clear to me how a dominant condition can be applied, when there is a (I think) full mapping of new labels to REGEX-matched patterns. Does the dominant condition come into play when an unmatched horizon label is encountered? Good call on excluding color, perhaps a reasonable place to use simulated mixtures. Or, it could be that someone is only interested in Munsell value, and in that case a wt. mean is probably fine. |
Yeah, agreed, its something I feel like we have talked a lot about but just never have implemented a generic function for.
Dominant condition is based on the thickest horizon's value within each "adjacency" group. There is no aggregation applied to an unmatched horizon label--it just passes through to the result. As currently implemented this is modifying the I need to ponder a bit on this as far as what the right ergonomics are. Probably better to have the function calculate a new horizon designation column and set it as the generalized horizon label, or horizon designation label, or both? Rather than modifying existing data--which might be important for downstream context.
Yeah, I think I want to keep the function relatively "dumb" but specific columns could be targeted with specific methods by either calling |
I must note that this concept is implemented in part as Line 275 in fcad3e8
I forgot about that function and reinvented part of the wheel here, I suppose. It is nice that
|
A note for later: one of the simplest cases is flattening on an intended horizon index, such as chkey or phiid, with a short-circut for 99% of the cases where no flattening is required. |
- implement ignoring specific numeric columns with `ignore_numerics - add argument`AGGFUN` for column name specific aggregations - default for categories is to returndominant condition rather than just thickest layer
…per profile, add example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great addition. I'd like to add one more test since there is quite a lot of complexity here. GHL assignment → collapsing horizons is going to be really useful.
R/collapseHz.R
Outdated
h <- h[order(h[[idn]], h[[hzd[1]]]),] | ||
|
||
# replace horizons in parent SPC | ||
replaceHorizons(x) <- h |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we retain original horizon IDs and set new ones? This level of accounting could be useful, but adds clutter. I'd vote for new horizon IDs and dump the old ones for simplicity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we retain original horizon IDs and set new ones? This level of accounting could be useful, but adds clutter.
I don't think it is a good idea to dump the horizon IDs and calculate new ones. This could create unintended conflicts with the auto-calculated sequential integer IDs.
I think it is not desirable for the calculated hzID
nor one set by the user e.g. chkey
or chiid
. Essentially I want to retain reference to the source horizon. I don't see an issue with current behavior that keeps a subset of the parent hzidcolumn()
values. Can you elaborate on what you are thinking here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I missed something: when two horizons are collapsed what is the resulting horizon ID?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The horizon ID (character data type) should be from the thickest horizon in the group. It is not handled differently than any other horizon-level variable by default, but you can supply a custom aggregation function (to e.g. concatenate IDs) if desired
Thanks. What else would you like to test? |
Some degenerate cases, and an additional hand-verified example. I can add those after it is merged. |
I've tested this on some |
Testing out PR, and I see that many numeric columns are converted to character: data("jacobs2000", package = "aqp")
x <- jacobs2000
a_pattern <- c(`A` = "^A",
`E` = "E",
`Bt` = "[B]+t",
`Bh` = "[B]+h",
`C` = "^C",
`foo` = "bar")
m <- collapseHz(jacobs2000, pattern = a_pattern)
profile_id(m) <- paste0(profile_id(m), "_c")
str(horizons(m))
This only happens sometimes: # no aggregation required, data are left as-is
m <- collapseHz(jacobs2000, by = 'name')
# numeric data converted to character
m <- collapseHz(jacobs2000, by = 'matrix_color_munsell') |
Additional tests I would like to add:
|
Good catch!! Fixed in 489d2a8 There were two things: a typo/not replaced variable name that changed last night, which caused the numeric aggregation to not work at all, and also there was some implicit conversion to matrix going on (where first column of matrix was character so all others were) that I also introduced last night. The reason it only happens sometimes is the "aggregation" code was not triggered if all of the labels are unique within profiles, essentially the "short circuit" you refer to here: #307 (comment) |
Thanks, I added tests and checks for averages and dominant condition with and without NA, and missing values in With the dominant condition aggregation, rather than thickest, I think I would rather not implement tests of horizon logic within this function, or at least have the option to run it without checking. I think the aggregation will generally work "correctly" even with missing depths, overlaps, gaps, though the results may not be sensible. I added some tests for the a couple common cases (filled profiles, 0 profile SPC, One might use something like this function to fix certain types of logic errors by crafting either a pattern or column indicating things that should be combined. |
80dc40f adds a basic screening function that just checks for any NA top or bottom depths and for bottom depths shallower than top. A warning directs the user to use |
Looks good to me, merge away! |
Here is another draft idea for a common operation people want to perform on SoilProfileCollections. This is a minimal implementation but I think several other options could greatly enhance the ways it could be used. It is likely it could be used to resolve some issues in soilDB such as ncss-tech/soilDB#120, ncss-tech/soilDB#122, ncss-tech/soilDB#135
collapseHz()
function makes it easy to aggregate adjacent horizons (collapsing them into a single horizon) using groups based on pattern matching. Numeric properties are aggregated via weighted average, and other properties are aggregated via dominant condition (i.e. the aggregate value is the value with greatest cumulative thickness in each group)TODO/additional ideas
grepl()
based, likethicknessOf()
/ AddthicknessOf()
#306)GHL()
)--in lieu of specifiedpattern
and possibly change default forhzdesgn
when GHL() is set?Example behavior